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Wc present a framework for performing efficient regression in general metric spaces. Roughly 
00 ' speaking, our regrcssor predicts the value at a new point by computing a Lipschitz extension 

— the smoothest function consistent with the observed data — while performing an optimized 
structural risk minimization to avoid overfitting. The offline (learning) and online (inference) 
, stages can be solved by convex programming, but this naive approach has runtime complexity 

0(n'^), which is prohibitive for large datasets. We design instead an algorithm that is fast when 
the the doubling dimension, which measures the "intrinsic" dimensionality of the metric space, 
' is low. 

We use the doubling dimension multiple times; first, on the statistical front, to bound fat- 
shattering dimension of the class of Lipschitz functions (and obtain risk bounds); and second, 
on the computational front, to quickly compute a hypothesis function and a prediction based 
(f"*) ' on Lipschitz extension. Our resulting regressor is both asymptotically strongly consistent and 

, comes with finite-sample risk bounds, while making minimal structural and noise assumptions. 

1 Introduction 

The classical problem of estimating a continuous- valued function from noisy observations, known 
as regression, is of central importance in statical theory with a broad range of applications, see 



PR83I, 


BFOS84 


, Nad89, 


HarOq, 


GKKW02 



^ ■ target function are made, the regression problem is termed nonparametric. Informally, the main 

objective in the study of nonparametric regression is to understand the relationship between the 
regularity conditions that a function class might satisfy (e.g., Lipschitz or Holder continuity, or 
sparsity in some representation) and the minimax risk convergence rates p:sy04| , |Was06| . A further 



consideration is the computational efficiency of constructing the regression function. 

The general (univariate) nonparametric regression problem may be stated as follows. Let {X, p) 
be a metric space, namely is a set of points and p a distance function, and let T-L he a collection 
of functions ("hypotheses") /i : — )• [0, 1]. (Although in general, h is not explicitly restricted to 
have bounded range, typical assumptions on the diameter of X and the noise distribution amount 
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to an effective truncation.) Tlie space X x [0, 1] is endowed witli some fixed, unknown probability 
distribution //, and the learner observes n i.i.d. draws {Xi,Yi) ~ /x. The learner then seeks to fit 
the observed data with some hypothesis /i G so as to minimize the risk, usually defined as the 
expected loss E \h{X) — for {X, Y) fi and some g > 1. 

Two limiting assumptions have traditionally been made when approaching this problem: (i) the 
space X is Euclidean and (ii) Yi = h*{Xi) + ^j, where h* is the target function and is an i.i.d. 
noise process, often taken to be Gaussian. Although our understanding of nonparametric regression 
under these assumptions is quite elaborate, little is known about nonparametric regression in the 
absence of either assumption. 

The present work takes a step towards bridging this gap. Specifically, we consider nonparametric 
regression in an arbitrary metric space, while making no assumptions on the distribution of the data 
or the noise. Our results rely on the structure of the metric space only to the extent of assuming 
that the metric space has a low "intrinsic" dimensionality. Specifically, we employ the doubling 



dimension of X, denoted ddim(Af), which was introduced by |GKL03| based on earlier work of 



[Ass83, Cla99], and has been since utilized in several algorithmic contexts, including networking. 



combinatorial optimization, and similarity search, see e.g. pCSW09 , Tal04 , KL04| , |BKL06| , HM06 



CG06| , GlaOGll . (A formal definition and prevailing examples appear in Section ^.) Following the 



work in [ GKK1C1(] on classification problems, our risk bounds and algorithmic runtime bounds are 



stated in terms of the doubling dimension of the ambient space and the Lipschitz constant of the 
regression hypothesis, although neither of these quantities need be known in advance. 

Our results. We consider two kinds of risk: Li (mean absolute) and L2 (mean square). More 
precisely, for g € {15 2} we associate to each hypothesis h £ Ti the empirical Lg-risk 

1 " 

Rn{h)=Rn{h,q) = -y\h{Xi)-Y,\'l (1) 



n 



1=1 



and the (expected) L^-risk 

R{h) = R{h,q) = B\h{X) -Y\'^ = [ \h{x) -yl" fi{dx,dy). (2) 

It is well-known that h{x) = M[y|X = x] (where M is the median) minimizes R{-,1) over 
ah integrable h G [0, l]'^ and h{x) = E[Y \ X = x] minimizes R{-,2). However, these expressions 
are of little use as neither is computable without knowledge of fj,. To circumvent this difficulty, we 
minimize the empirical Lg-risk and assert that the latter is a good approximation of the expected 
risk, provided Ti meets certain regularity conditions. 

To this end, we define the following random variable, termed uniform deviation: 

An{n) = An{n, q) = sup \Rn{h) - R{h)\ . (3) 

hen 

It is immediate that 

R{h) < Rn{h) + An{n) (4) 

holds for all h £ Ti (i.e., the expected risk of any hypothesis does not exceed its empirical risk by 
much), and it can further be shown | BBL05f| that 



R{h)<R{h*) + 2An{n), (5) 
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where h £ T-L is a minimizer of the empirical risk and h* £ T-L is a, minimizer of the expected risk 
(i.e., the expected risk of h does not exceed the risk of the best admissible hypothesis by much). 

Our contribution is twofold: statistical and computational. The algorithm in Theorem [4.1| 
computes an ry-additive approximation to the empirical risk minimizer in time //"^^'^^^^'^^^nlog^ n. 
By Theorem 5A, this hypothesis can be evaluated on new points in time 7y~'-'('i<^™('^)) log ?i. The 
expected risk of this hypothesis decays as the empirical risk plus l/poly(n), as proved in Theorem 



3.6 . Although our bounds explicitly depend on the doubling dimension, the latter may be efficiently 
estimated from the data | GK10(| . 



Related work. There are many excellent references for classical Euclidean nonparametric regres- 
sion assuming i.i.d. noise, see for example |GKKW02| , |Har90| , |BFOS84| , |Nad89| , ^R8|, pGL96(| . For 
metric regression, a simple risk bound follows from classic VC theory via the pseudo-dimension, see 
e.g. [Pol84, |Vap95| , Ney06 |. However, the pseudo-dimension of many non-trivial function classes, 
including Lipschitz functions, grows linearly with the sample size, ultimately yielding a vacuous 
bound. An approach to nonparametric regression based on empirical risk minimization, though only 



for the Euclidean case, may already be found in [LZ95]; see the comprehensive historical overview 
therein. Indeed, Theorem 5.2 in iGKKW02| gives a kernel regressor for Lipschitz functions that 
achieves the minimax rate. Note, however that (a) the setting is restricted to Euclidean spaces; and 
(b) the cost of evaluating the hypothesis at a new point grows linearly with the sample size (while 
our complexity is roughly logarithmic). As noted above, another feature of our approach is its abil- 
ity to give efficiently computable finite-sample bounds, as opposed to the asymptotic convergence 
rates obtained in |GKKW02| , |LZ95| and elsewhere. 

More recently, risk bounds in terms of doubling dimension and Lipschitz constant were given in 
| Kpo09 |, assuming an additive noise model, and hence these results are incomparable to ours; for 



instance, these risk bounds worsen with an increasingly smooth regression function. Following up. 



a regression technique based on random partition trees was proposed in | KD11 |, based on mappings 
between Euclidean spaces and assuming an additive noise model. Another recent advance in non- 
parametric regression was Rodeo HLWOj] , which escapes the curse of dimensionality by adapting to 
the sparsity of the regression function. 

Our work was inspired by the paper of von Luxburg and Bousquet [vLB04|, who were apparently 
the first to make the connection between Lipschitz classifiers in metric spaces and large-margin 
hyperplanes in Banach spaces, thereby providing a novel generalization bound for nearest-neighbor 
classifiers. They developed a powerful statistical framework whose core idea may be summarized 
as follows: to predict the behavior at new points, find the smoothest function consistent with the 
training sample. Their work raises natural algorithmic questions like how to estimate the risk for 
a given input, how to perform model selection (Structural Risk Minimization) to avoid overfitting, 
and how to perform the learning and prediction quickly. Follow-up work [|GKK10(| leveraged the 
doubling dimension simultaneously for statistical and computational efficiency, to design an efficient 
classifier for doubling spaces. Its key feature is an efficient algorithm to find the optimal balance 
between the empirical risk and the penalty term for a given input. 

Minh and Hoffman |MH04] take the idea in [vLB04| in a more algebraic direction, establishing 
a representer theorem for Lipschitz functions on compact metric spaces. 



Paper outline. We start by defining the basic concepts in Section |2|. The information-theoretic 
bulk of the paper is in Section S, where risk bounds for Lipschitz functions are given via fat- 
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shattering dimension estimates. Our efficient model selection procedure is described in Section 
^ and the local prediction algorithm is described in Section ^. Finally, we establish the strong, 
universal consistency of our regression estimate in Section |6|. 



2 Technical background 

We use standard notation and definitions throughout. 



Metric spaces, Lipschitz constants. A metric d on a set X is a. positive symmetric function 
satisfying the triangle inequality d{x,y) < d{x,z) + d{z,y); together the two comprise the metric 
space {X, d). The diameter of a set A C ^ is defined by diam(A) = sup^. j^g^ d{x, y). In this paper, 
we always take diam(A') = 1. The Lipschitz constant of a function / : A" — ?• M, denoted ||/||Lip) is 
defined to be the smallest L > that makes — f{y)\ < Ld{x,y) hold for all x,y E A". 

Doubling dimension. For a metric {X, p), let A > be the smallest value such that every ball in 
X can be covered by A balls of half the radius. The doubling dimension of X is ddim(Af) = log2 A. 
A metric (or family of metrics) is called doubling if its doubling dimension is uniformly bounded. 
Note that while a low Euclidean dimension implies a low doubling dimension (Euclidean metrics of 
dimension d have doubling dimension 0(d)), low doubling dimension is strictly more general than 
low Euclidean dimension. 

Doubling metrics occur naturally in many data analysis applications, including for instance 
the geodesic distance of a low-dimensional manifold residing in a possibly high-dimensional space 
(assuming mild conditions, e.g. on curvature). Some concrete examples for doubling metrics include: 
(i) for fixed d equipped with an arbitrary norm, e.g. or a mix between li and I2] (ii) the 
planar earthmover metric between point sets of size k = 0{l) [ PKKiq , Section 6]; (hi) the n-cycle 



graph and its continuous version, the quotient M/Z, and similarly bounded-dimensional tori. In 
addition, various networks that arise in practice, such as peer-to-peer communication networks and 
online social networks, can be modeled reasonably well by a doubling metric space. 

The following packing property can be demonstrated via repeated applications of the doubling 
property (see, for example [ KL04j| ): 



Lemma 2.1. Let X be a metric space and suppose that S <Z X has a minimum interpoint distance 
of at least a. Then 

< |- 2diam(S) ^ddimW^ 

Note that there is no loss of generality in assuming diam(A') = 1, since we can always scale the 
distances and Lipschitz constants to ensure this. 



Graph spanner. A graph H \s a {\ + (5)-stretch spanner for graph G if iJ is a subgraph of G 
that contains all nodes of G (but not all edges), and dH{u,v) < (1 + 6)dG{u,v) for all u,v G G, 
where dciu, v) {dniu, v)) denotes the shortest path distance between u and v in G (H). If spanner 
H achieves this bound even when its distance function is restricted to paths in H of k edges or 
fewer, then is an (1 + (5)-stretch fc-hop spanner for G. The definitions above apply also when the 
edges of G have positive lengths. 
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A spanner for a metric space X is defined by thinking of the metric space as a graph G on 
the vertex set X, which is the complete graph with edge-lengths corresponding to distances in 

Doubling metrics are known to admit good spanners [ CGMZO^ , HM06| , GR08|. We will use a 
specific variant described in Appendix ^ 



Fat-shattering dimension. We recall the definition of the fat-shattering dimension | ABCH97 , 
BS99|. A finite set T C is said to be 7-shattered by a family of functions H C M"^ if there exists 



some function r S M"^ such that for each label assignment b G {—1, 1} there is an /i G H satisfying 
b{x){h{x) — r{x)) > 7 > for all x G T. The 7-fat-shattering dimension of Ti, denoted by fat-y(^), 
is the cardinality of the largest set 7-shattered by H (or 00 if the latter is unbounded). 

In order for the random variable An{'H) defined in to be measurable, T-L must satisfy some 
mild measure-theoretic conditions (pathologies are known to exist |Dud84]). To avoid the measure- 
theoretic technicalities associated with taking suprema over uncountable function classes, all hy- 
pothesis classes Ti in this paper are assumed to be admissible, in the sense of |Pol84]. 



3 Bounds on uniform deviation via fat shattering 

In this section, we derive tail bounds on the uniform deviation A„('H) defined in (^) in terms of the 
the smoothness properties of T-L and the doubling dimension of the underlying metric space {X, p). 

3.1 Preliminaries 



We rely on the powerful framework of fat-shattering dimension developed by |ABCH97], which 
requires us to incorporate the value of a hypothesis and the loss it incurs on a sample point into 
a single function. This is done by associating to any family of hypotheses mapping X 1— t- [0, 1], 
the induced family T = J-^ of functions mapping X x [0, 1] 1— >■ [0, 1] as follows: for each h £ Ti the 

n 



corresponding f = f^^ £ J^y is given by 



fhi^^y) = IH^) - y\ 



qe{l,2}. 



(6) 



In a slight abuse of notation, we define the uniform deviation of a class of [0, l]-valued functions 
over X X [0,1]: 



A„(-F) 



sup 



1 " 

- V/(x„y,)-E/(x,y) 



1=1 



(7) 



it is obvious that A„(J^) = A„('H, 1 



3.2 Basic generalization bounds 

Let us write 



UL = {h(i [0,1]-^ : 



Lip 



< 



(8) 



to denote the collection of [0, l]-valued L-Lipschitz functions on X. We proceed to bound the 
7-fat-shattering dimension of . 
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Theorem 3.1. Let T-Ll he defined on a metric space {X,p), where diam(A') = 1. Then 

^ ^ ddim(A')+l 



holds for g G {1, 2} and all < j < ^. 

Remark. The notation is a convenient shorthand for combining the resuhs for q = 1 and 

q = 2 and is not intended to imply an interpolation for intermediate values. 

Proof: Fix a 7 > and recall what it means for J'y^^^ to 7-shatter a set S = T x Z (where 
T € A'l'^l and Z G [0, l]!'^'): there exists some function r S M'^ such that for each label assignment 
b € { — 1, 1}'^ there is an / G satisfying b{s){f{s) — r{s)) > 7 for all s G S*. 

Put K = [7-(9+i)/2] and define the map vr : 5 ^ {0, 1, . . . , J^} by 

7r(s) = 7r{t,z) = [Kz\ . 
Thus, we may view S as being partitioned into K + 1 buckets: 

K 

S=[j^-\k). (9) 

fc=0 

Consider two points, s = {t,z) and s' = {t',z'), belonging to some fixed bucket TT~^{k). By 
construction, the following hold: 

(i) \z - z'\ < < 7('?+i)/2 

(ii) since .F^^ 7-shatters S (and recalling @), there is an /i G T-Ll satisfying \h{t) — z\'^ < r — j 
and \h{t') — z'\^ > r' + 7 for some 7<r<r'<l — 7. 

Conditions (i) and (ii) imply that 

|/i(t) - Kt')\ > (/ + 7)1/" - (r - 7)1/-? -\z- z'\ > 7('?+i)/2, (10) 



where Lemma A.l is invoked for q = 2. 

The fact that h is L-Lipschitz implies that 



L 



and by Lemma 2.1 we have 

\ ddim(A')+l 



/ r \' 



(11) 



for each k G {0, 1,..., [7-(9+i)/2] }. Together (|) and (|T|) yield our desired bound on |5|, and 
hence on the fat shattering dimension of J^^- n 

The following generalization bound, implicit in ABCH97], establishes the learnability of continuous- 
valued functions in terms of their fat-shattering dimension. 
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Theorem 3.2. Let T he any admissible function class mapping X x[0,l] to [0, 1] and define A„(J^) 
as in Then for all < e < 1 and all n>2/e^ , 

P(A„(J-)>e) < 2in(=^] exp{-e^n/36) 

where d = fatj/24(-^)- 

Corollary 3.3. Fix an 1 > e > and q G {1,2}. Let T-Ll be defined on a metric space {X,p) and 
recall the definition of ^n(l-Li,q) in Then for all n > 

/288n\ '^l°g(24en/£) 

P(A„(Hl,(7) >e) < 24nf— exp(-e2n/36) (12) 

where 

1 \ / r \ ddim(A')+l 

d=[l+ ^ ^' ^ 



e/24)(9+i)/2 J \^(e/24){9+i)/' 



We can conclude from Corollary that there exists e(n, L, 5) such that with probability at 
least 1 — 5, 

A„(HL,g) < e(n,L,5), (13) 
and by essentially inverting Equation (|l2|), we have 

/ ( h , /TN /rdriimfA-Wl \ l/(2+2±i(ddim(^) + l))^ \ 

,(„,i,i)<oL„U».(^^^Vn) . (14) 

(For simplicity, the dependence of e(-) on ddim(Af) is suppressed.) This implies via (^) that 

R[h) < Rn{h) + e{n,L,5) 
uniformly for al\ h £ T-Ll with high probability. 

3.3 Generalization for approximate hypotheses 

Since the actual regression functions we compute in Section ^ are additive approximations to smooth 
functions, but not necessarily smooth themselves, we will need some machinery for handling these. 
To this end, let us write 

[n]r, = {h' -.Bhen s.t. \\h - h'W^ < T]} (15) 

to denote all 77-perturbations of some function class H C M'^ for t] > 0. 



Our next objective is an analogue of Theorem 3.1 for additively perturbed Lipschitz functions. 



Lemma 3.4. Let Q be a collection of real-valued functions defined over some set V and let [Qlf^ be 
the 7j -perturbation of Q , as defined in ^iBj). Then 



fat^([^]^) < fat^_^(^). 

holds for all 'J > 77. 
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Proof: Suppose that [Q]rj shatters the set S = {xi, . . . , Xd} C ^ at level 7. Then there is some 
r G M'^ so that for all 5 G {-1, 1}'^ there is an G [Q]rj such that 

6(x)(/^(rE)-r(x))>7 (16) 

for all X £ S. Now by definition, for each /' G Qrj there is some / G ^ so that sup^g^^ \fi^) ~ 
f'{x)\ < rj. Define G 5 to be such an Ty-approximation for each We claim that the collection 

|/b : 6 G { — 1, shatters 5 at level 'J — rj. Indeed, replacing /^(x) with fbix) in ( [l6| ) perturbs 
the left-hand side by an additive term of at most rj. □ 



Corollary 3.5. Let Tii be defined on a metric space cmd [HL]ri be the r] -perturbation of Hi 

for 1] > 0, as defined in ([T^. Further, let -^|^^] be the induced family of functions / : x [0, 1] — )• 

[0, 1] as defined in Then, for q G {1, 2} and 7 G {qrj, we have 



M^iJ'L.) < 1 + 



L 



ddim(A')+l 



Proof: Lemma A. 2 shows that perturbing an /i G Tii by an additive rj perturbs the correspond- 
ing / G J~y^^ by an additive qrj. Lemma 3^ relates the fat-shattering dimension of an ry-perturbed 
function class to that of its unperturbed version. Finally, Theorem 3.1 gives an estimate on the fat- 
shattering dimension for the case where the unperturbed function class consists of the L-Lipschitz 
functions on the doubling metric space {X,p). □ 

We are now able to extend Corollary to perturbations of Lipschitz functions. 

Theorem 3.6. Fix an e > and q G {1, 2}. Let T-Ll be defined on a metric space {X, p) and [T-LlIti 
be the rj -perturbation ofHi for < t] < e/24q. Then for all n > Ije^ , 



( 



P{^n{[nL]^,q)> e) < 24n(^— exp(-e^n/36) (17) 
where An{7i,q) is the uniform deviation defined in ^ and 

\ ddim(A')+l 

d = d{L,r]) = ( 1 + 



(e/24 - gr/)('?+i)/2 J \{e/2A - qr])ii+^)/'^ 



Inverting the relation in ( p^TD we get an estimate analogous to (13): with probability at least 

1-5, 

A„([^L]r„g) < e(n,L,(5) + 24gr?. 



where e(-) is as in (|1J), implying a risk bound of 

R{h) < Rn{h) + e(n, L, 5) + 24gr?. 
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3.4 Simultaneous bounds for multiple Lipschitz constants 



So far, we have established the following. Let (X, p) be a doubling metric space and a collection 



of L-Lipschitz [0, l]-valued functions on X. Then Corollary 3.3 guarantees that for all e,5 £ (0, 1) 
and n > no(e, (5, L, ddim(Af)), we have 

P{An{nL) >e)<6, (18) 



where A„(?^i) is the uniform deviation defined in (Q). Theorem 3^ extends this to perturbed 
Lipschitz functions. 

Since our computational approach in Section ^ requires optimizing over Lipschitz constants, we 
will need a bound such as (jl^) that holds for many function classes of varying smoothness simul- 
taneously. This is easily accomplished by stratifying the confidence parameter 6, as in [3BWA98|. 
We will need the following theorem: 

Theorem 3.7. Let 

be a sequence of function classes taking X to [0, 1] and let G [0, 1], k = 1,2, . . ., be a sequence 
summing to 1. Suppose that e : N x N x (0, 1) — )• [0, 1] is a function such that for each /c E N, with 
probability at least 1 — 6, we have 

Al{n^''^)<e{n,k,6). 

Then, whenever some h G UfceNl^'''^^]''? (achieves empirical risk Rn(h) on a sample of size n, we 
have that with probability at least 1 — 5, 

R{h) < Rn{h) + e(n, k, 5pk) VA;. (19) 

Proof: An immediate consequence of the union bound. □ 



The structural risk minimization principle implied by Theorem amounts to the following 
model selection criterion: choose an /i G H^^'^ for which the right-hand side of (|^) is minimized. 



In applying Theorem |3.7| to Lipschitz classifiers in Section |J below, we impose a discretization on 



the Lipschitz constant L to be multiples of Formally, we consider the stratification H.^^'^ =711^, 

■Hl, C -Hl^ C. . . , 

where = kr] with corresponding p^ = 2^^^ for k = 1, 2, . . .. This means that whenever we need a 
hypothesis that is an L-Lipschitz regression function, we may take k = \Li]\ and use e{n,k,52~^) 
as the generalization error bound. Note that all possible values of L are within a factor of 2 of the 
discretized sequence -L^. 



4 Structural risk minimization 

In this section, we address the problem of efficient model selection when given n observed samples. 
The algorithm described below computes a hypothesis that approximately attains the minimum 
risk over all hypotheses. 
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Recall the risk bound achieved as a consequence of Theorems and 3.7. Whenever some 
h G (JfceN [^^'^^j achieves empirical risk Rn{h) on a sample of size n, we have the following bound 
on R{h), the true risk of h: 

R{h) < Rn{h) + e{n, k, 5pk) + 24^77, (20) 

with probability at least 1 — 6 (where the diameter of the point set has been taken as 1, and 
e{n,k,6pk) > Y^2/n is the minimum value of e for which the right-hand side of Equation ([T^ ) is 
at most 6). In the rest of this section, we devise an algorithm that computes a hypothesis that 
approximately minimizes our bound from (pO|) on the true risk, denoted henceforth 



Rr,{h) = RnQi) + e(n, /c, 5pk) + 2Aqr]. 

Notice that on the right-hand side, the first two terms depend on L, but only the first term depends 
on the choice of /i, and only the third term depends on r/. 

Theorem 4.1. Let (Xj,li) for i = 1, . . . ,n be an i.i.d. sample drawn from fi, let rj G (0, and 

let h* be a hypothesis that minimizes Rr^Qi) over all h G UfceN [^'''^^It^' There is an algorithm 
that, given the n samples and rj as input, computes in time ry~<^{'^'l'™{'^))n log^ n a hypothesis h' G 

Rriih') < 2Rn{h*). (21) 



Remark. We show in Theorem 5.1 how to quickly evaluate the hypothesis h' on new points. 

In proving the theorem, we will find it convenient to compare the output h' to a hypothesis 
h that is smooth (i.e. Lipschitz but unperturbed). Indeed, let h* be as in the theorem, and 
h G UfcgN"^^^^ 9, hypothesis that minimizes Rri(h). Then Rn{h*) < Rn(h) < Rn{h*) + rj, 
and we get Rr^{h*) < R-q(h) < Rri{h*) + rj. Accordingly, the analysis below will actually prove 
that Rri{h') < 2Rri(h) — 2r], and then (|2l| ) would follow easily, essentially increasing the additive 
error by 2r/. Moreover, once Equation ( pl| ) is proved, we can use the above to conclude that 
Rr]{h') < 2i?o(^) + 0{rj), which compares the risk bound of our algorithm's output h' to what we 
could possibly get using smooth hypotheses. 

In the rest of this section we consider the n observed samples as fixed values, given as input to 
the algorithm, so we will write Xi instead of Xj. 



4.1 Motivation and construction 

Suppose that the Lipschitz constant of an optimal unperturbed hypothesis h were known to be 
L = L. Then e{n,k,5pk) is fixed, and the problem of computing both h and its empirical risk 
Rn{h) can be described as the following optimization program with variables f{xi) for i G [n] to 
represent the assignments h{xi). Note it is a Linear Program (LP) when Q = 1 and a quadratic 
program when q = 2. 



Minimize 


EiG[n] IVi- fiXi)\'' 




subject to 


\f{Xi) - f{Xj)\ < L ■ p{Xi,Xj) 


Vi, j G [n] 




< f{xi) < 1 


yi G [n] 
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It follows that h could be computed by first deriving L, and then solving the above program. 
However, it seems that computing these exactly is an expensive computation. This motivates our 
search for an approximate solution to risk minimization. We first derive a target Lipschitz constant 
L' that "approximates" L, in the sense that it minimizes the objective max{i2„(/i'), e(n, A:, Jpfc)}. 
Notice that Rn{h') may be computed by solving LP (|2^ ) using the given value L' for L. We wish 
to find such L' via a binary search procedure, which requires a method to determine whether a 
candidate L satisfies L < L', but since our objective need not be a monotone function of L, we 
cannot rely on the value of the objective at the candidate L. Instead, recall that the empirical risk 
term Rn{h') is monotonically non-increasing, and the penalty term e(?i, 5pk) is monotonically non- 
decreasing, and therefore we can take L' to be the minimum value L for which Rn{h') < e(n, k, 6pk) 
(notice that both terms are right-continuous in L) . Our binary search procedure can thus determine 
whether a candidate L satisfies L < V hy checking instead whether Rn{h') < e{n, k, Spk)- 

Were the binary search on L to be carried out indefinitely (that is, with infinite precision), it 
would yield L' and a smooth hypothesis h' satisfying Rr^{h') < 2Rj^(h), where the factor 2 originates 
from the gap between maximum and summation. The formal proof actually gives a slightly stronger 
bound: 

R^{h') - 2Aqri < 2max{Rn{h'),e{n,k,5pk)} < 2{Rn{h) + e{n,k,5pk)) < 2{R^{h) - 24qr]). 

(In our actual LP solver below, h' will not be necessarily smooth, but rather a perturbation of a 
smooth hypothesis.) However, to obtain a tractable runtime, we fix an additive precision of rj to 
the Lipschitz constant, and restrict the target Lipschitz constant to be a multiple of r/. Notice that 
Rrjih) < 2 for sufficiently large n (since this bound can even be achieved by a hypothesis with 
Lipschitz constant 0), so by Equation ( IT^ it must be that L < rP^^\ It follows that the binary 
search will consider only 0(log(n/7?)) candidate values for L' . 

To bound the effect of discretizing the target L' to multiples of 77, we shall show the existence of a 
hypothesis h that has Lipschitz constant L < max{L — r/,0} and satisfies Rr^{h) < Rr^{h) + r]. To see 
this, assume by translation that the minimum and maximum values assigned by h are, respectively 
and a < 1. Thus, its Lipschitz constant is L > a (recall we normalized diam(A') = 1). Assuming 
first the case a > 1], we can set h{x) = (1 — 2) . h{x), and it is easy to verify that its Lipschitz 
constant at most (1 — < L — t], and Rr^{h) < Rrj{h) + r]. The case a < ij is even easier, as now 
there is trivially a function h with Lipschitz constant and Rrj{h) < Rr^iji) + r]. It follows that 
when the binary search is analyzed using this h instead oih, we actually get 

Rr^{h') < 2Rr,{h) - 2Aqri < 2Rr,{h) - 22qr] < 2Rr,{h*) - 20qr]. 



It now remains to show that given L' , program (22) may be solved quickly (within certain 



accuracy), which we do in Sections O and O 



4.2 Solving the linear program 

We show how to solve the linear program, given target Lipschitz constant L'. 

Fast LP-solver framework. To solve the linear program, we utilize the framework presented 
by Young [ [YouOl ] for LPs of following form: Given non-negative matrices P,C, vectors p,c and 



precision /? > 0, find a non- negative vector x such that Px < p and Cx > c. Young shows that if 
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there exists a feasible solution to the input instance, then a solution to a relaxation of the input 
program (specifically, Px < (H-/3)p and Cx > c) can be found in time 0(m(i(log m)//3^), where m 
is the number of constraints in the program and d is the maximum number of constraints in which 
a single variable may appear. 

In utilizing this framework for our problem, we encounter a difficulty that both the input 
matrices and output vector must be non-negative, while our LP (22) has difference constraints. 
To bypass this limitation, for each LP variable f{xi) we introduce a new variable Xi and two new 
constraints: 



f{Xi)+Xi < 1 
f{Xi) +Xi > 1 

By the guarantees of the LP solver, we have that in the returned solution 1 — f(xi) < Xi < 
1 — /(xj) + /5 and Xi > 0. This technique allows us to introduce negated variables —f{xi) into the 
linear program, at the loss of additive precision. 



Reduced constraints. A central difficulty in obtaining a near-linear runtime for the above linear 
program is that the number of constraints in LP ( p2| ) is Q{n?). We show how to reduce the number 
of constraints to near-linear in n, namely, fj-Oi<idira(x))^_ -^Ve will further guarantee that each of 
the n variables /(xj) appears in only 7y~0(ddim(A')) constraints. Both these properties will prove 
useful for solving the program quickly. 

Recall that the purpose of the Q{n?) constraints is solely to ensure that the target Lipschitz 
constant is not violated between any pair of points. We will show below that this property can be 
approximately maintained with many fewer constraints: The spanner described in Appendix^ has 
stretch 1 + 5, degree ^-0(ddmi(A')) hop-diameter c'logn for some constant c' > 0, that can be 
computed quickly. Build this spanner for the observed sample points {xi : « G [n]} with stretch 
1 + rj (i.e., set S = rj) and retain a constraint in LP (22) if and only if its two variables correspond 
to two nodes that are connected in the spanner. It follows from the bounded degree of the spanner 
that each variable appears in 7y-0(ddim(A')) constraints, which implies that there are 7^-0(ddim(A'))^ 
total constraints. 



Modifying remaining constraints. Each spanner-edge constraint \ f{xi) — f{xj)\ < L' ■ p(xi, Xj) 
is replaced by a set of two constraints 

f{xi)+Xj < 1 + L' ■ p{xi,Xj) 
f{xj) + Xi < 1 + L' ■ p{xi,Xj) 

By the guarantees of the LP solver we have that in the returned solution, each spanner edge 
constraint will satisfy 

\f{x,)-f{xj)\ < -l + {l + p)[l + L' ■p{xi,xj)] 
= f3 + {l + /3)L'-p{xi,Xj) 

Now consider the Lipschitz condition for two points not connected by a spanner edge: Let 
xi, . . . ,Xk^i be a (1 + r/)-stretch {k < c'logn)-hop spanner path connecting points x = xi and 
x' = Xk+i- Then the spanner stretch guarantees that 
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Choosing [3 
pairs 



\f{x)-f{x')\ < Eti[/3 + (l + /3)L'-p(x„x.+i)] 

< pc'logn+ {1 + p)L' ■ {l+r])p{x,x') 

2 

24gc^ log n ^ noting that (1 + + 77) < (1 + 2?7), we have that for all point 



\f{x)-f{x')\ < ^ + (l + 2r?)L'.p(x,x') 



We claim that the above inequahty ensures that the computed hypothesis h' (represented by 
variables f{xi) above) is a 6r/-perturbation of some hypothesis with Lipschitz constant L'. To 
prove this, first note that if L' = 0, then the statement follows trivially. Assume then that (by 
the discretization of L'), L' > rj. Now note that a hypothesis with Lipschitz constant (1 + 2>rj)L' 
is a 3r/-perturbation of some hypothesis with Lipschitz constant L' . (This follows easily by scaling 
down this hypothesis by a factor of (1 + 3r/), and recalling that all values are in the range [0, 1].) 
Hence, it suffices to show that the computed hypothesis h' is a 377-perturbation of some hypothesis 
h with Lipschitz constant (l + 3r/)L'. We can construct h as follows: Extract from the sample points 
S = {xi\i(z[n] a (r//L')-net A^,|^ then for every net-point z G set h{z) = h\z)^ and extend this 
function h from to all of S without increasing Lipschitz constant by using the McShane- Whitney 
extension theorem [McS34, Whi34] for real- valued functions. Observe that for every two net-points 
z / z' G AT, 

\h{z) - h{z)\ < i| + (1 + 2r?)L' • p(z, z') < (1 + 3??)L' • p{z, z'). 

It follows that h (defined on all of S) has Lipschitz constant L < 1 + 3??. Now, consider any point 
X £ S and its closest net-point z £ N; then p{x, z) < r]/L' . Using the fact h{z) = h{z), we have that 
\h'{x) - h{x)\ < \h'{x) - h'{z)\ + \h{z) - h{x)\ < + (1 + 2r])L' ■ p{x, z)] + (1 + 3r])L' ■ p{x, y) < 

^ + 2 + 5r/^ < Sf]. We conclude that h' is 3?7-perturbation of h, and a 677-perturbation of some 
hypothesis with Lipschitz constant L'. 



Objective function. We now turn to the objective function ^ \yi — f{xi)\. We use the same 
technique as above for handling difference constraints: For each term \yi — /(xj)| in the objective 
function we introduce the variable Wi and the constraint 

f{xi) +Wi >yi 

Note that the solver imposes the constraint that wi > 0, so we have that Wi > max{0, ?/j — f{xi)}. 
Now consider the term f{xi) + 2wi, and note that the minimum feasible value of this term in the 
solution of the linear program is exactly equal to yi + \yi — f{xi)\: If /(xj) > yi then the minimum 
feasible value of is 0, which yields f{xi) + 2wi = f{xi) = yi + {f{xi) - yi) = yi + \yi - f{xi)\. 
Otherwise we have that f{xi) < yi, so the minimum feasible value of Wi is yi — f{xi), which yields 
f{xi) + 2wi = 2yi - f{xi) = yi + \yi - f{xi)\. 

The objective function is then replaced by the constraint 

^The notion of a net referred to here is standard in metric spaces, and means that (i) the distance every two points 
in A'' is at least {rj/L'); and (ii) every point in S is within distance {11/L') from at least one point in A''. It can be 
easily constructed by a greedy process. 



13 



which by the above discussion is equal to - "^iivi + \yi — f{xi)\) < r, and hence is a direct bound 
on the empirical error of the hypothesis. We choose bound r via binary search: Recalling that 
Rn{h') < 1 (since even a hypothesis with Lipschitz constant can achieve this bound), we may set 
r < 1. By discretizing r in multiples of rj (similar to what was done for L'), we have that the binary 
search will consider only 0(logr/~^) guesses for r. Note that for guess r', the solver guarantees only 
that the returned sum is less than (1 + (3)r' < r' + (3 < r' + i]. If follows that the discretization 
of r and its solver relaxation of r introduce, together, at most an additive error of 2r/ in the LP 
objective, i.e., in Rn{h') and in R^{h'). 



Correctness and runtime analysis. The fast LP solver ensures that h' computed by the above- 
described algorithm is a Gr^-perturbation of a hypothesis with Lipschitz constant L'. As for R{h'), 
which we wanted to minimize, an additive error of 2r/ is incurred by comparing h' to h instead 
of to h* , another additive error of 2r] arises from discretizing L into V (i.e., comparing to h 
instead of /i), and another additive error 4r/ introduced through the discretization of r and its 
solver relaxation. Overall, the algorithm above computes a hypothesis h' G UfegN [^^'^^ler; '^i^h 
Rrj{h') < 2Rrj{h*) — 16r]. The parameters in Theorem 4A are achieved by scaling down r] to ^ and 
the simple manipulation R^/Q{h) = Rrj{h) — 20qr]. 

Finally, we turn to analyze the algorithmic runtime. The spanner may be constructed in time 
(9(7^-0(ddim(A'))^ jQgj^-j^ Young's LP solver |You01| is invoked 0(log^log|) times, where the log ^ 

term is due to the binary search for L', and the log ^ term is due to the binary search for r. 
To determine the runtime per invocation, recall that each variable of the program appears in 
d = 7^-0(ddim(A')) constraints, implying that there exist m = 7y-0(ddim(A'))j^ total constraints. Since 
we set /3 = 0(?7^/logn), we have that each call to the solver takes time 0(md(logm)//3^) < 
^-0(ddim(A'))^^Qg2^^ for a total runtime of 7?-0(<^'^''^('^))n log^ n log ^ log i < 7?-^('i'i''^('^))nlog^ n. 
This completes the proof of Theorem 4.1 for q = 1. 



4.3 Solving the quadratic program 

Above, we considered the case when the loss function is linear. Here we modify the objective 
function construction to cover the case when the loss function is quadratic, that is ^ \yi~ f{xi)\'^ ■ 
We then use the LP solver to solve our quadratic program. (Note that the spanner-edge construction 
above remains as before, and only the objective function construction is modified.) 
Let us first redefine Wi by the constraints 

f{Xi) + Wi < 1 
f{Xi) +Wi > 1 

It follows from the guarantees of the LP solver that in the returned solution, 1 — f{xi) < Wi < 
1 - f{xi) + /3 and wt > 0. 

Now note that a quadratic inequality v > x'^ can be approximated for x G [0, 1] by a set of 
linear inequalities of the form 

V > 2{jr])x - {jriY 

for < j < ^; these are just a collection of tangent lines to the quadratic function. Note that 
the slope of the quadratic function in the stipulated range is at most 2, so this approximation 
introduces an additive error of at most 2r/. 
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Since \yi — takes values in the range [0, 1], we will consider an equation set of the form 

Vi > 2{jr])\yi - f{xi)\ - {jvif + 2r] 

which satisfies that the minimum feasible value of vi is in the range \\yi — f {xi)\^ , \yi — f {xi)\^ +2rj\. It 
remains to model these difference constraints in the LP framework: When /(xj) < y,, the equation 
set 

Vi + 2{jri)f{x.i) > 2{jri)yi - + 2?? 

exactly models the above constraints. When /(xj) > yi, the lower bound of this set may not be 
tight, and instead the equation set 

Vi + 2{j7])wi > -2{jr])y, - {jj]f +2r] + 2(jr/)(l + /3) 

models the above constraints, though possibly increasing the value of Vi by 2(jr])f3 < rj. (Note 
that when f{xi) < yi, the lower bound of the second equation set may not be tight, so the first 
equation set is necessary. Also, note that whenever the right hand side of an equation is negative, 
the equation is vacuous and may be omitted.) 

The objective function is then replaced by the inequality 

^EiVi < r, 

where r is chosen by binary search as above. 

Turning to the runtime analysis, the replacement of a constraint by 0(1/7?) new constraints 
does not change the asymptotic runtime. For the analysis of the approximation error, first note 
that a solution to this program is a feasible solution to the original quadratic program. Further, 
given a solution to the original quadratic program, a feasible solution to the above program can be 
found by perturbing the quadratic program solution by at most 3?] (since additive terms of 2r] and 



T] are lost in the above construction). The proof of Theorem 4T for q = 2 follows by an appropriate 
scaling of r]. 

5 Approximate Lipschitz extension 

In this section, we show how to evaluate our hypothesis on a new point. More precisely, given a 
hypothesis function / : S" — t- [0, 1], we wish to evaluate a minimum Lipschitz extension of / on a 
new point x ^ S. That is, denoting S = {xi, . . . ,x„}, we wish to return a value y = f{x) that 
minimizes maxjj |. Necessarily, this value is not greater than the Lipschitz constant of the 

classifier, meaning that the extension of / to the new point does not increase the Lipschitz constant 



of / and so Theorem 3.7 holds for the single new point. (By this local regression analysis, it is 



not necessary for newly evaluated points to have low Lipschitz constant with respect to each other. 



since Theorem 3.7 holds for each point individually.) 

First note that the Lipschitz extension label y oi x ^ S will be determined by two points of 
S. That is, there are two points Xi,Xj € S, one with label greater than y and one with a label 
less than y, such that the Lipschitz constant of (x, y) relative to each of these points (that is, 
_ fi^i) y _ y fi^i) \ jxiaximum over the Lipschitz constant of (x,y) relative to any point in S. 
Hence, y cannot be increased or decreased without increasing the Lipschitz constant with respect 
to one of these points. 
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Note then that an exact Lipschitz extension may be derived in 0(n^) time in brute-force fashion, 
by enumerating ah point pairs in S, calculating the optimal Lipschitz extension for x with respect 
to each pair alone, and then choosing the candidate value for y with the highest Lipschitz constant. 
However, we demonstrate that an approximate solution to the Lipschitz extension problem can be 
derived more efficiently. 

Theorem 5.1. An rj-additive approximation to the Lipschitz extension problem can be computed 
in time f^~0[ddim{A:)) |Qg,^_ 

Proof: The algorithm is as follows: Round up all labels f{xi) to the nearest term jr]/2 (for 
any integer < j < 2/r/), and call the new label function /. We seek the value of /(x), the optimal 
Lipschitz extension value for x for the new function /. Trivially, f{x) < f{x) < f{x) + rj/2. Now, 
if we were given for each j the point with label j?7/2 that is the nearest neighbor of x (among all 
points with this label), then we could run the brute- force algorithm described above on these 2/?7 
points in time 0(r/"^) and derive f{x). However, exact metric nearest neighbor search is potentially 
expensive, and so we cannot find these points efficiently. We instead find for each j a point x' € S 
with label f{x') = jr]/2 that is a (1 -|- ^)-approximate nearest neighbor of x among points with 
this label. (This can be done by presorting the points of S into 2/r] buckets based on their / 
label, and once x is received, running on each bucket a (1 -|- ^)-approximate nearest neighbor search 
algorithm due to [CG06] that takes 7^-0(ddim{A')) jQg^ time.) We then run the brute force algorithm 



on these 2/r] points in time 0(r/~^). The nearest neighbor search achieves approximation factor 
1 -f ^, implying a similar multiplicative approximation to L, and thus also to \y — f{x')\ < 1, which 
means at most rj/2 additive error in the value y. We conclude that the algorithm's output solves 
the Lipschitz extension problem within additive approximation r/. □ 



6 Strong consistency 

In previous sections, we defined an efficient regression algorithm and analyzed its finite-sample 
performance. Here we show that it enjoys the additional property of being strongly consistent. 
Note that our regression hypothesis is constructed via approximate nearest neighbors; see [DGKL94| 



for consistency results of nearest-neighbor regression functions in Euclidean spaces. We say that 
a regression estimator is strongly consistent if its expected risk converges almost surely to the 
optimal expected risk. Further, it is called universal if this rate of convergence does not depend on 
the sampling distribution fx. In this section, we establish the strong, universal consistency of our 
regression estimate. 

Theorem 6.1. Let {X,p) be a compact metric space and suppose X x [0,1] is endowed with a 
probability measure fi. For q € {1,2}, suppose that there is a continuous h* : X ^ [0,1] that 
achieves 

R{h\q) = ini R{h,q) 

where R{h,q) is defined in ^ and the infimum is taken over all continuous h : X ^ [0,1]. 
Then there exists a sequence Ln increasing to oo such that 

R{hn,q)^R{h\q) 

almost surely, where hn is a minimizer of Rn{h,q) over h E 
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Proof: Assume for now that h* is Lipschitz with constant L* . Define 

Ln = log n 



and pick any e > 0. 

Given a sample of n pairs (Xj, Yi) drawn i.i.d. under /i, let be a minimizer of 



i?n(/i,g) = iV|/i(Xi)-y.r 

in ^ — ' 



n 

i=l 



over T-Ll^- (The infimum is achieved since T-Ll^ is compact by the Arzela-Ascoli theorem.) Similarly, 
define /i* to be a minimizer of 



R{h,q)= \h{x) - y^^ fi{dx,dy) 

over Hl^. 

Consider the event = {R{hn) > R{h*) + 2e}. By (||) and Theorem we have, forn > 



/ \ ^" log(24en/£) 

P(A) < 24nf=^j exp(-eV36) 



'24\ ('/+l)/2\ / r xddimW+l 

Dn=\l+[ — ] * ' 



where 



e y y V(e/24)(5+i)/ 

Our choice of Ln ensures that L»„ = 0((log n)'i^™('^)+i). Now the series ©(n)?"!^'"^" exp(-J7(n)) 
converges by the nth root test, and so the Borel-Cantelli lemma implies that almost surely, we have 

R{hn)-R{K) < 2e 

for all but finitely many n. Since L„ f oo, there is an ng for which L„g > L* . The inclusion Hl* ^ 
7^/,,^ for all n > uq implies that R{h^) = R{h*) for all n > uq. We conclude that R{hn) — t- R{h*) 
almost surely, as claimed. 

The case where h* is continuous but not Lipschitz is handled by an approximation argument. 
It is easy to show that on a compact set, every continuous function is a uniform limit of Lipschitz 
functions. That is, there exists a sequence of gn G such that 

hn - /l*|loo = sup \gn{x) - h* {x)\ 0. 

Pick an e > 0, let N = N{£) be such that there is an /i^ € ^Ljv with — h*\\^ < e. 
The above argument shows that almost surely 

RiK) - Rihlj) < 2e 

holds for all but finitely many n. 
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Additionally, 



\R{hlj)-R{h* 



\h*Nix) - y\'^ Kdx,dy) - / \h*{x) - y\'' n{dx,dy) 



< 

< 

< qe 



Xx[0,l] JXx[0,l] 

h*Nix) - y\'^ - \h*{x) - y\'i IJ.{dx,dy) 

Xx[0,l] 

h*^{x) — h*{x)\ fi{dx) 



■JX 



where we invoked Lemma A. 2 in the second inequality. This shows that R{hn) — R{h*) < (2 + q)£ 
almost surely for almost all n and completes the proof. 

□ 
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A Technical lemmata 



The following lemma is used in the proof of Theorem 3.1 
Lemma A.l. For all < ^ < r < r' < 1 — 'j we have 



27' 



3/2 



(23) 



Proof: Fix some 7 > 0. An elementary calculation shows that for any given 7, the l.h.s. of 



53) is minimized when r 



which again is a standard calculation. 



1 — 7. Thus it remains to show that 

27'/^ 



2' 



□ 



The following lemma is used in the proof of Corollary |3.5| and Theorem 6.1 

Lemma A. 2. For g G {1, 2} and a, a', b G [0, 1], we have 

||a-6|«- la'-fel'?! < q\a-a'\. 

Proof: Consider the case q = 1. Then 

\\a - b\ - \a' - b\\ < \{a - b) - {a' - b) \ = \a - a'\ 

which proves the claim for this case. For q = 2, recall that the Lipschitz constant of a differentiable 
real function is bounded by the maximal absolute value of its derivative. The function / : [0, 1] — t- 
[0, 1] defined by /(a) = (a — 6)^ for a fixed b G [0, 1] has |/'(a)| < 2, which proves the claim for 
q = 2. □ 
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B A small-hop spanner 



In this section, we prove the following theorem. See Section g for the definition of a spanner. 

Theorem B.l. Every finite metric space X on n points admits a (1 + 5)-stretch spanner with 
degree (5-0(ddim(A')) ^j^^, < 6 < ^) and hop-diameter O(logn), that can be constructed in time 

^-0(ddim(A'))^lQg^_ 

Gottheb and Roditty |1GR08|] presented for general metrics a (1 + 5)-stretch spanner with degree 
j-0(ddim(A')) g^^^ construction time 6~^^^'^^^^'^^^ nlog n, but this spanner has potentially large hop- 
diameter. Our goal is to modify this spanner to have low hop-diameter, without significantly 
increasing the spanner degree. Now, as described in [|GRO§[| , the points of X are arranged in a 
tree of degree §-0[ddim[X)) ^ ^^^^ ^ spanner path is composed of three consecutive parts: (a) a path 
ascending the edges of the tree; (b) a single edge; and (c) a path descending the edges of the tree. 
We will show to decrease the number of hops in parts (a) and (c). Below we will prove the following 
lemma. 

Lemma B.2. Let T he a tree containing directed child-parent edges (n = \T\), and let p be the 
degree of T . Then T may be augmented with directed descendant- ancestor edges to create a DAG 
G with the following properties: (i) G has degree p + 3; and (ii) The hop-distance in G from any 
node to each of its ancestors is O(logn). 



Note that Theorem B.l is an immediate consequence of Lemma |B.2| applied to the spanner of 



| |GROS ] . It remains only to prove Lemma B.2. We will first need a simply preliminary lemma: 



Lemma B.3. Consider an ordered path on nodes xi, . . . Let these nodes be assigned positive 
weights Wi = w{xi), and let the weight of the path be W = Y27=i w{xi). there exists a DAG G on 
these nodes with the following properties: 

1. Edges in G always point to the antecedent node in the ordering. 

2. The hop-distance from any node Xi to the root node xi is not more than 0(log ^). 

3. The hop- distance from any node Xi to an antecedent Xj is not more than 0(log^ + log^). 
4- G has degree 3. 

Proof: The construction is essentially the same as in the biased skip-lists of Bagchi et al. 



|BBG05|. Let xi and Xn be the left and right end nodes of the path, and let the other nodes be the 
middle nodes. Partition the middle nodes into two child subpaths {x2, . . . , Xi} (the left child path) 
and {xi^i, . . . ,x„_i} (the right child path), where Xi is chosen so that the weight of the middle 
nodes of each child path is not more than half the weight of the middle nodes of the parent path. (If 
the parent path has three middle nodes or fewer, then there will be a single child path.) The child 
paths are then recursively partitioned, until the recursion reaches paths with no middle nodes. 

The edges are assigned as follows. A right end node of a path has two edges leaving it. One 
points to the left end node of the path (unless the path has only one node). The other edge points 
to the right end node of the right (or single) child path. A left end node of a path has one edge 
leaving it: If this path is a right child path, the edge points to the left sibling path's right end node. 
If this path is a left or single child path, then the edge points to the parent's left end node. The 
lemma follows via standard analysis. □ 
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Given Lemma B.3, we can now prove Lemma |B.2| , from which Theorem [B. 1| immediately follows. 

Proof: Given tree T, decompose T into heavy paths: A heavy path is one that begins at 
the root and continues with the heaviest child, the child with the most descendants. In a heavy 
path decomposition, all off-path subtrees are recursively decomposed. For each heavy path, let the 
weight of each node in the path be the number of descendant nodes in its off-path subtrees. For 
each heavy path, we build the weighted construction of Lemma |B.3| . 

Now, a path from node u £ T to v £ T traverses a set of at most [log n] heavy paths, say paths 

Pi, . . . ,Pj. The number of hops from n to f is bounded by 0(log ^^7^ + (^J2i=i ^^S "l^K)^ ) 
log ^^;^) = 0(log n), and the degree of G is at most p + 3. □ 
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